library(tidyverse) # for graphing and data cleaning
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
theme_set(theme_minimal()) # My favorite ggplot() theme :)
# Lisa's garden data
data("garden_harvest")
# Seeds/plants (and other garden supply) costs
data("garden_spending")
# Planting dates and locations
data("garden_planting")
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week (HINT: use the wday() function from lubridate). Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
mutate(day = wday(date, label = TRUE)) %>%
group_by(vegetable, day) %>%
mutate(weight_lbs=weight*0.00220462) %>%
summarize(total_weight_lbs = sum(weight_lbs)) %>%
pivot_wider(id_cols = vegetable,
names_from = day,
values_from = total_weight_lbs)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot from the garden_planting table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
group_by(vegetable, variety) %>%
summarize(total_weight_lbs = sum((weight)*0.00220462)) %>%
left_join(garden_planting,
by = c("vegetable", "variety"))%>%
select(vegetable, variety, total_weight_lbs, plot)
The issue looks like there is some missing data. We are not aware of the plots or the variety of some plants. We could drop these from our data.
garden_harvest and garden_spending datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.Using the garden harvest data we can find total weight/amount of harvested vegetables. We then use the wholefoodsmarket data to calculate how much that same amount of vegetables would cost to buy in-store. With that total of potential spending, we then just subtract off the amount you actually spent on seeds/materials(from garden spending) to see how much you saved. This of course doesn’t take into account labor/time.
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
mutate(weight_lbs=weight*0.00220462) %>%
arrange(date) %>%
ggplot(aes(y=weight_lbs, x=date))+
geom_col()+
facet_wrap(~fct_reorder(factor(variety), date))+
labs(title = "Total Harvest of tomatoes in Pounds",
x="Month",
y="Pounds")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
select(vegetable,variety) %>%
mutate(lowercase= str_to_lower(variety),
length=str_length(variety)) %>%
arrange(vegetable,length) %>%
distinct()
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
mutate(er_ar=str_detect(variety,"er|ar")) %>%
distinct()
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usually, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density() +
labs(title="Rental Density by Date", x="Date", y="Density")
This plot shows a great density of rentals in the month of October. The density declines (probably due to climate/seasons) in November, goes slightly up in December and drops again going into January.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time = hour + min) %>%
ggplot(aes(x = time))+
geom_density()+
labs(title="Rental Density by Time of Day", x="Time of Day", y="Density")
This graph shows the most popular times to rent a bike in a day are around 8-9 in the morning and 5-6 at night. This is probably due to the popular work day of 9 to 5, so people are renting bikes to get to and from work.
Trips %>%
mutate(day = wday(sdate, label=TRUE)) %>%
ggplot(aes(y = day))+
geom_bar()+
labs(title = "Rentals Compared by Day",
x="Rentals",
y="")
This shows that most rentals are on Mondays and Fridays.
Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time = hour + min,
day=wday(sdate,label=TRUE))%>%
ggplot(aes(x = time))+
geom_density()+
facet_wrap(~day, scales = "free") +
labs(title="Rental Density by Time of Day", x="Time of Day", y="Density")
The days Monday through Friday the bikes tend to be rented during rush hours when people often are getting to and from work. On Saturdays and Sundays however the bikes seem to be rented from 1-3 which would be more for fun or leisure.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises.
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time = hour + min,
day=wday(sdate,label=TRUE))%>%
ggplot(aes(x=time, fill=client))+
geom_density(alpha=0.5, color=NA) +
facet_wrap(~day, scales = "free") +
labs(title="Rental Density by Time of Day", x="Time of Day", y="Density")
This graph shows us that Casual clients on average rent bikes around 12-3 pm. The Registered clients have different behavior depending on if it’s the weekend or not. The registered clients seem to rent bikes as a means of getting to and from work, unless it’s a weekend then they behave more like the casual clients.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time = hour + min,
day=wday(sdate,label=TRUE))%>%
ggplot(aes(x=time, fill=client))+
geom_density(alpha=0.5, color=NA, position = position_stack()) +
facet_wrap(~day, scales = "free") +
labs(title="Rental Density by Time of Day", x="Time of Day", y="Density")
I think using the stacks is useful. You can see the total/overall trend as well as the trends of the two types of clients. This method is particularly bad at showing the casual clients behaviors. The overall graph is shaped as the total rentals, and since casual is stacked on-top of registered it’s hard to see what casual clients are really doing.
position = position_stack()). Add a new variable to the dataset called weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time = hour + min,
day=wday(sdate,label=TRUE),
weekend = ifelse(day %in% c("Sat","Sun"), "Weekend", "Weekday"))%>%
ggplot(aes(x=time, fill=client))+
geom_density(alpha=0.5, color=NA) +
facet_wrap(~weekend, scales = "free") +
labs(title="Rental Density by Time of Day", x="Time of Day", y="Density")
In this plot we can see the same patterns mentioned before, but we can also see the density differences between weekend and weekday. We can see that on the weekend, casual clients rent more than on a weekday.
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(hour = hour(sdate),
min = (minute(sdate))/60,
time = hour + min,
day=wday(sdate,label=TRUE),
weekend = ifelse(day %in% c("Sat","Sun"), "Weekend", "Weekday"))%>%
ggplot(aes(x=time, fill=weekend))+
geom_density(alpha=0.5, color=NA) +
facet_wrap(~client, scales = "free") +
labs(title="Rental Density by Time of Day", x="Time of Day", y="Density")
This shows us that causal clients have the same time-of-day trends regardless of day, however if it is a weekend more clients will rent. This graphic also shows us that registered clients behavior change depending on if it’s a weekend or weekday(as mentioned on previous problems). ### Spatial patterns
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Trips %>%
left_join(Stations,by=c("sstation" = "name")) %>%
group_by(lat, long) %>%
summarise(departures = n()) %>%
ggplot(aes(y = lat,x = long, color = departures))+
geom_point()+
labs(title="Departures From Each Station", x="Longitude", y="Latitude")
From this longitude/latitude plot, we can see there is an obvious center where most depatrures happen. It is around -77.04long and 38.9lat.
Trips %>%
left_join(Stations,by=c("sstation"="name")) %>%
group_by(lat, long) %>%
summarise(prop=sum(client=="Casual")/n()) %>%
ggplot(aes(y=lat,x=long, color=prop))+
geom_point()+
labs(title="Proportion of Departures From Each Station that are Casual Clients", x="Longitude", y="Latitude", color="Proportion of Casual Users")
This plot tells us that the center I mentioned before is made of of way less casual clients than registered clients.
as_date(sdate) converts sdate from date-time format to date format.Ten_Highest <- Trips %>%
mutate(date = as_date(sdate)) %>%
group_by(sstation,date) %>%
summarise(num_departures=n()) %>%
arrange(desc(num_departures)) %>%
head(n=10)
Ten_Highest
Columbus Circle / Union Station is the most popular station to depart from.
Trips %>%
mutate(date=as_date(sdate)) %>%
inner_join(Ten_Highest, by=c("sstation","date"))
Trips %>%
mutate(date=as_date(sdate)) %>%
inner_join(Ten_Highest, by=c("sstation","date")) %>%
mutate(weekday =(wday(sdate, label=TRUE))) %>%
group_by(client, weekday) %>%
summarise(depart=n()) %>%
group_by(client) %>%
mutate(prop = depart/sum( depart)) %>%
pivot_wider(id_cols=weekday,names_from = client, values_from = prop)
This tells us that the highest single day of rentals is Saturday for the Casual clients with almost 50% of all casual rentals being that day. The most popular day for registered clients is Thursday.
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.